Statistik für Biologie FS 2022

Solutions

Solutions to the exercises will be posted here each week.

Back to main page. Back to exercises.

Week 1

Exercise 1

  1. No. Collectors did not choose randomly, but preferred rarer types over more common ones.

  2. It is a sample of convenience.

  3. There is bias in every year’s sample because rare types are over-represerented. Further this bias might change across years as the frequencies of the two morphs have changed over time.

Exercise 2

  1. Discrete.

  2. Technically it is a discrete variable (because fractions can be enumerated / are restricted to finitely many values in a finite sample) but it makes sense to treat it as a continuous variable if the sample is very large.

  3. Discrete. There are no half crimes.

  4. Continuous. Body mass is continuous and hence the log as well.

Exercise 3

  1. E(xplanatory): Altitude (categorical: high vs low) R(esponse): Growth rate S(tudy type): Observational

  2. E: Treatment (standard vs. tasploglutide) R: Rate of insulin release S: Experimental

  3. E: Health status (schizophrenia vs. healthy) R: Frequency of illegal drug use S: Observational

  4. E:Number of legs R: Survival propability S: Experimental

  5. E: Treatment (advanced communication therapy vs. social visits without formal therapy) R: Communication ability S: Experimental

Exercise 4

The main problem is a strong bias in the sample. To see this, consider the sample of planes that were used in this study. Only planes that were not hit in critical areas were available to estimate the distribution of bullet holes. Planes that were hit in a critical area, i.e., one that leads to a crash, were not available because they did not return to base. With this knowledge, it becomes clear that it would have been better to reinforce the areas were no, or very little, bullet holes were found, namely the cockpit and engine.

Exercise 5

  1. The population parameter being estimated is all the small mammals of Kruger National Park.

  2. No, the sample is not likely to be random. In a random sample, every individual has the same chance of being selected. But some small mammals might be easier to trap than others (for example, trapping only at night might miss all the mammals active only in the day time). In a random sample individuals are selected independently. Multiple animals caught in the same trap might not be independent if they are related or live near one another (this is harder to judge).

  3. The number of species in the sample might underestimate the number in the Park if sampling was not random (e.g., if daytime mammals were missed), or if rare species happened to avoid capture. Even if the sample is perfectly random, the estimator will underestimated the true number in most cases. You can sample fewer species just by chance, but not more species as there actually are. Thus - on average - you will underestimate the true number.

Week 2

Exercise 6

  1. You should see a blank plot. This is what ggplot does, it simply creates a “canvas” to which you can add “layers”.

  2. How many rows are in mpg? How many columns?

str(mpg)
## tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

There are 234 observations of 11 variables. thus, there are 11 columns and 234 rows. we can confirm this by looking at the dimension of the underlying data matrix:

dim(mpg)
## [1] 234  11
?mpg

drv is a categorical vaiable indicating the drive type: f = front-wheel drive, r = rear wheel drive, 4 = 4wd

ggplot(data  = mpg) + 
  geom_point(mapping = aes(y = hwy,x = cyl)) + 
  theme_classic() + 
  labs(title = "A scatterplot", y = "miles per gallon on highways",x="cylinders")

ggplot(data  = mpg) + 
  geom_point(mapping = aes(x = class,y = drv)) + 
  theme_classic() + 
  labs(title = "A useless plot", x = "class",y="drive type")

both drv and class are categorical variable and it does not make sense to visualize them this way. What would be a better way to show these data?

  1. Color is used wihtin the aesthetics function, where we specifiy which varibales should be used. “blue” is however not a variable in our data frame. Thus it is ignored. The follwoing code creates the correct plot:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

  1. A continuous variable will lead ot a continuous color gradient rather than a discrete set of colors.
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = cty, y = hwy, color = displ))

  1. The “+” symbol always has to be in the top row:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

Week 3

Exercise 7

genes <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter04/chap04e1HumanGeneLengths.csv"))

# a) 

n = dim(genes)[1]
print(n)
## [1] 20290
# b)    
sum.gene.lengths = sum(genes$geneLength)

# c)      

mean.gene.length1 = sum.gene.lengths/n
mean.gene.length2 = mean(genes$geneLength)

print(mean.gene.length1)
## [1] 2622.027
print(mean.gene.length2)
## [1] 2622.027
# d)
square.sum.gene.lengths = sum(genes$geneLength^2)
print(square.sum.gene.lengths)
## [1] 223678143206
# e)    

# by hand:
variance.gene.length = sum((genes$geneLength - mean.gene.length1)^2)/(n-1)

print(variance.gene.length)
## [1] 4149235
# for comparison:
var(genes$geneLength)
## [1] 4149235
# f)    
sd(genes$geneLength)
## [1] 2036.967
# or 

sd.gene.length = sqrt(variance.gene.length)

# g)     

cv.gene.length = sd.gene.length/mean.gene.length1
print(cv.gene.length)
## [1] 0.7768672
# h)    
# Which one do you like best?

ggplot(data = genes) + 
  geom_histogram(mapping = aes(x = geneLength),color="black",fill="darkred") + 
  theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = genes) + 
  geom_density(mapping = aes(x = geneLength),color="black",fill="darkred") + 
  xlab("Länge der Gene") +
  ylab("Häufigkeit") +
  scale_x_log10() +
  theme_classic()

ggplot(data = genes) + 
  geom_boxplot(mapping = aes(y = geneLength),color="black",fill="darkred") + 
  theme_classic() + 
  coord_flip()

Week 4

Exercise 8

# load the data file:
stickleback <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter03/chap03e3SticklebackPlates.csv"))

# one possible solution:

mean.mm = mean(stickleback$plates[stickleback$genotype=="mm"])
mean.mM = mean(stickleback$plates[stickleback$genotype=="Mm"])
mean.MM = mean(stickleback$plates[stickleback$genotype=="MM"])

median.mm = median(stickleback$plates[stickleback$genotype=="mm"])
median.mM = median(stickleback$plates[stickleback$genotype=="Mm"])
median.MM = median(stickleback$plates[stickleback$genotype=="MM"])

par(mfrow = c(1,3))

hist(stickleback$plates[stickleback$genotype=="MM"],
     main = "MM",xlab = "number of plates",
     col="darkred")

abline(v = mean.MM,col="black")
abline(v = median.MM,col="blue")

hist(stickleback$plates[stickleback$genotype=="Mm"],
     main = "Mm",xlab = "number of plates",
     col="darkred")
abline(v = mean.mM,col="black")
abline(v = median.mM,col="blue")

hist(stickleback$plates[stickleback$genotype=="mm"],
     main = "mm",xlab = "number of plates",
     col="darkred")
abline(v = mean.mm,col="black")
abline(v = median.mm,col="blue")

# alternatively, we can first create a data frame 
# with the means per genotype
# check out the help function of ddply
# (this is only one way to do this, you can also do it by 
# "hand" or using other functions such as tapply) :

library(dplyr)
df.mean = ddply(stickleback, "genotype", summarize, m.number = mean(plates))
df.median = ddply(stickleback, "genotype", summarize, m.number = median(plates))

# then we plot it with a single ggplot
ggplot() +
  geom_bar(data = stickleback, mapping = aes(x=plates),binwidth=5,color="black",fill="darkred") +
  geom_vline(data = df.mean, aes(xintercept=m.number),col="black",size=1) +
  geom_vline(data = df.median, aes(xintercept=m.number),col="blue",size = 1) +
  facet_wrap(~ genotype) +
  ylab("frequency") +
  theme_classic()
## Warning: Ignoring unknown parameters: binwidth

Exercise 9

The code simualtes the samplind distirbution of the mean. Here is an alternative version for the Poisson distribution:

# We choose the Poisson distribution with mean lambda = 50
lambda = 50

# We do 1000 replicates
reps = 1000

# We choose the sample size n to be 10
sample_size = 10

# the means of each sample will be stroed in this vector
means = vector("numeric",reps)

# in this loop, we will calcualte the means of the sample
for (i in 1:reps)
{
  # take a sample of the poisson distribution
  sample = rpois(sample_size,lambda)
  
  # store the mean in our vector
  means[i] = mean(sample)
}

# a histogram - note that we set freq = FALSE to 
# show realtive frequencies 
hist(means,freq=F,main="The sampling distribution of a Poisson distribution")

# we specify the x values for the theoretical poisson
x = seq(0,3*lambda,by=0.1)

#we calcualte the standard error:
SE = sqrt(lambda)/sqrt(sample_size)

# caclualte the corresponding PDF of the poisson
y = dnorm(x,lambda,SE)

lines(x,y)

Week 5

Exercise 10

  1. What is the probability that a randomly drawn slice has pepperoni on it?

\[P(pepperoni) = 5/8 = 0.625 = 62.5 \%\]

  1. What is the probability that a randomly drawn slice has both pepperoni and anchovies on it?

\[P(pepperoni \quad and \quad anchovies) = 2/8 = 25 \%\]

  1. What is the probability that a randomly drawn slice has either pepperoni or anchovies on it?

\[P(pepperoni \quad or \quad anchovies) = 7/8 = 0.875 = 87.5\%\]

  1. Are pepperoni and anchovies mutually exclusive on this pizza?

No. There are two slices with both pepperoni and anchovies.

  1. Are olives and mushrooms mutually exclusive on this pizza?

Yes. There is no slice that has both olives and mushrooms on it.

  1. Are getting mushrooms and getting anchovies independent when randomly picking a slice of pizza?

No, because

\[P(anchovy) = 4/8 = 0.5\]

\[P(mushroom) = 3/8 = 0.375\]

and

\[P(mushroom \quad and \quad anchovy) = 1/8\]

but

\[P(mushroom)*P(anchovy) = 3/8 * 4/8 = 3/16 = 0.1875 = 18.75 \% .\]

  1. If I pick a slice from this pizza and tell you that it has olives on it, what is the chance that it also has anchovies?

\[P(anchovy \mid olive) =\] \[P(anchovy \quad and \quad olive)/P(olive) = (1/8) / (2/8) = 1/2 = 0.5 = 50\%\]

  1. If I pick a slice from this pizza and tell you that it has anchovies on it, what is the chance that it also has olives?

\[P(olive \mid anchovy) =\] \[P(olive \quad and \quad anchovy)/P(anchovy) = (1/8) / (4/8) = 1/4 = 0.25 = 25\%\]

  1. Seven of your friends each choose a slice at random and eat them without telling you what toppings they had. What is the chance that the last slice has olives on it?

\[P(last \quad slice \quad has \quad olives) = 2/8 = 0.25 = 25\%\].

This can be seen directly because each slice has the same probability to be the last slice.

  1. You choose two slices at random. What is the chance that they have both olives on them? (Be careful: after removing the first slice, the probability of choosing one of the remaining slices changes.)

\[P(first \, slice \,has \, olives) = 2/8\]

\[P(second \, slice \, also \, has \, olives) = 1/7\]

\[P(both \, slices \, have \, olives) = 2/8 * 1/7 = 1/28 \approx 3.6\% \]

  1. What is the probability that a randomly chosen slice does not have pepperoni on it?

\[P(no \, pepperoni) = 1 – P(pepperoni) = 3/8 = 37.5\%\]

  1. Draw a pizza for which mushrooms, olives, anchovies and pepperoni are all mutually exclusive

A pizza for which mushrooms, olives, anchovies and pepperoni are all mutually exclusive

Exercise 11

  1. What is the probability that you picked up no dangerous snakes?

\[P(no \, dangerous \, snakes) = \] \[ P(no \, dangerous \, snake \, in \, left \, hand) P(no \, dangerous \, snake \, in \, right \, hand)\] \[= 3/8 * 2/7 = 6/56 = 0.107 = 10.7 \%\]

  1. Assume that any dangerous snake that you pick up has a probability of 0.8 of biting you. The defanged snakes do not bite. What is the chance that, in picking up your two snakes, you are bitten at least once?

\[P(bite) = P(bite \mid 0 \, dangerous \, snakes) P(0 \, dangerous \, snakes)\] \[+ P(bite \mid 1 \, dangerous snakes) P(1 dangerous \, snakes) \] \[+P(bite \mid 2 \, dangerous snakes) P(2 \, dangerous \, snakes)\]

\[P(0 \, dangerous \, snakes) = 0.107\] (from (a)) \[P(1 \, dangerous \, snake) = (5/8 3/7) + (3/8 5/7) = 0.536\] \[P(2 \, dangerous \, snakes) = 5/8 * 4/7 = 0.357\] \[P(bite \mid 0 \, dangerous \, snakes) = 0\] \[P(bite \mid 1 \, dangerous \, snake) = 0.8\] \[P(bite \mid 2 \, dangerous \, snakes) = 1- P(no \, bite \mid 2 \, dangerous \, snakes) = 1-(1- 0.8)^2 = 0.96\]

Putting al this together: \[P(bite) = (0*0.107)+(0.8*0.536)+(0.96*0.357) = 0.772\]

  1. If you picked up one snake and it didn’t bite you, what is the probability that it is defanged?

\[P(defanged \mid no \, bite)= \left[P(no \, bite \mid defanged)P(defanged)\right]/P(no \, bite)\] \[P(no \, bite \mid defanged)=1\] \[P(defanged) = 3/8\]

    $$P(no  \,  bite) = P(defanged)P(no bite  \mid defanged) $$
      $$+ P(dangerous)P(no  \,  bite  \mid dangerous) = (3/8 *1) + (5/8  (1-0.8)) = 0.5$$
        
        Therefore 
      $$P(defanged  \mid no  \,  bite) = (1*3/8 )/ (0.5) = 0.75 = 75\%.$$
        
        

Exercise 12

  1. What is the probability that all five researchers have calculated an interval that includes the true value of the parameter?

\(0.95^5 = 0.774\)

  1. What is the probability that at least one does not include the true parameter value.

\(1- 0.95^5 = 0.226\)

Week 6

Exercise 13

We first calculate the mean for each genotype:

stickleback = read.csv("~/Dropbox/Teaching/StatisticsForBiology/chap03e3SticklebackPlates.csv")

mean.mm = mean(stickleback$plates[stickleback$genotype=="mm"])
mean.Mm = mean(stickleback$plates[stickleback$genotype=="Mm"])
mean.MM = mean(stickleback$plates[stickleback$genotype=="MM"])

Next the upper and lower limit of the 95% CI using a t-test:

t.t.mm = t.test(stickleback$plates[stickleback$genotype=="mm"])
t.t.Mm = t.test(stickleback$plates[stickleback$genotype=="Mm"])
t.t.MM = t.test(stickleback$plates[stickleback$genotype=="MM"])

CI.mm = t.t.mm$conf.int
CI.Mm = t.t.Mm$conf.int
CI.MM = t.t.MM$conf.int

We put it all in a dataframe:

df = data.frame(means = c(mean.mm,mean.Mm,mean.MM),
                lower = c(CI.mm[1],CI.Mm[1],CI.MM[1]),
                upper = c(CI.mm[2],CI.Mm[2],CI.MM[2]),
                genotypes = c("mm","Mm","MM"))

Now we can plot everything:

ggplot(data = df) +
  geom_errorbar(mapping = aes(x = genotypes,ymin=lower, ymax=upper), width=.1) +
  geom_point(mapping = aes(x = genotypes,y=means)) 

If you prefer (I don’t) ou can also do a bar plot

ggplot(data = df) +
  geom_col(mapping = aes(x = genotypes,y=means,fill=genotypes)) +
  geom_errorbar(mapping = aes(x = genotypes,ymin=lower, ymax=upper), width=.1) +
  geom_point(mapping = aes(x = genotypes,y=means))

Either way, it is advised to also add the raw data:

ggplot(data = df)  + 
  geom_jitter(data = stickleback,mapping = aes(x = genotype,y = plates,color=genotype),size=0.8,width=0.05)+
  geom_errorbar(mapping = aes(x = genotypes,ymin=lower, ymax=upper), width=.2) +
  geom_point(mapping = aes(x = genotypes,y=means))

Exercise 14

We first choose values for \(\mu\) and \(\sigma\) (you can chose any values you want)

mu = 10
sigma = 2

Next we choose the values for the x axis. A good choice is to cover the range from \(\mu - 4 \sigma\) to \(\mu + \sigma\). We chose a stepsize of 0.1:

x = seq(mu-4*sigma,mu+4*sigma,by=0.1)

Now we compute the corresponding \(y\) values for our plot using the function dnorm:

y = dnorm(x, mean = mu, sd = sigma)

I now do the same thing for the standard normal distribution:

mu.std = 0
sigma.std = 1
x.std = seq(mu.std-4*sigma.std,mu.std+4*sigma.std,by=0.1)
y.std = dnorm(x.std, mean = 0, sd = 1)

And finally we plot both density functions next to each other:

par(mfrow = c(1,2))
plot(x,y,type="l",xlab="x",ylab="density")
plot(x.std,y.std,type="l",xlab="x",ylab="density")

We see that both curves look exactly the same. The only difference is the scale of the x and y axes.

random.sample = rnorm(1000,mu,sigma)
hist(random.sample,freq=F,ylim=c(0,0.2),breaks=30)
lines(x,y)

  1. Calculate the 10 percent quantile of the distribution. Store the value in a variable \(Q10\). Then calcualte the proportion of numbers in your random sample from (b) that are smaller or equal to \(Q10\). What do you find?
Q10 = qnorm(0.1,mean=mu,sd=sigma)
sum(random.sample<=Q10)/1000
## [1] 0.106

We find that approximately 10 percent of our random sample is smaller than \(Q10\). This is exactly the definition of a quantile!

Exercise 15

  p = seq(0.05,0.95,by=0.05)
  Q = qnorm(p,mean=0,sd=1)
  C = pnorm(Q,mean=0,sd=1)
  plot(p,C)

We find that \(p = C\). This means that the cumulative distribution function is the inverse of the quantile function.

The p-quantile \(q_p\) is defined as the number such that
\[P(X < q_p) = p.\] We define a quantile function \(Q(p) = q_p\). The cumulative distribution function \(F(x)\) gives us the value \[F(x) = P(X<x)\] Then \(F(q_p) = P(X < q_p) = p\)

and because

\[Q(p) = q_p\] if follows

that

\[F(Q(p)) = F(q_p) = p\] Here is a graphical representation of this relationship

cdf.std = pnorm(x.std,mean=0,sd=1)
plot(x.std,cdf.std,type="l",xlab="x",ylab = "P(X<x) ")
abline(v=qnorm(0.1,mean=0,sd=1),lwd=2,col="dodgerblue")
abline(h = 0.1,col="slategray",lwd=2)
text(-1.8,0.3,expression(Q[10]),col="dodgerblue")
text(-3.3,0.2,expression(P(X < Q[10])==0.1),col="slategray")

  1. Indicate the smallest 5 percent of the distribution. In other words: Let \(X\) be a normally distributed random variable. For which range \((-\infty,c_1]\) to we get \(P(X \leq c_1) = 0.05\)?
mu.std = 0
sigma.std = 1
x.std = seq(mu.std-4*sigma.std,mu.std+4*sigma.std,by=0.1)
y.std = dnorm(x.std, mean = 0, sd = 1)
plot(x.std,y.std,type="l")
c1 = qnorm(0.05,0,1)
abline(v = c1)

  1. Indicate the most extreme 5 percent of the distribution. In other words: For which range \((-\infty,c_2] \cup [c_2,\infty]\) to we get \(P(X \leq c_2 \text{ or } X \ > c_2 ) = 0.05\)? [Hint: you can use the quantile function of \(R\) and the symmetry of the Normal distribution to simplify things]
mu.std = 0
sigma.std = 1
x.std = seq(mu.std-4*sigma.std,mu.std+4*sigma.std,by=0.1)
y.std = dnorm(x.std, mean = 0, sd = 1)
plot(x.std,y.std,type="l")
c2 = qnorm(c(0.025,0.975),0,1)
abline(v=c2)

We can add some simualted data if we want:

set.seed(1984)
n = 1000
random.sample = rnorm(n,0,1)
sum(random.sample < c2[1] | random.sample > c2[2])/n
## [1] 0.052
plot(x.std,y.std,type="l")
rug(random.sample)
rug(random.sample[random.sample < c2[1] | random.sample > c2[2]],col="red")

We see that roughly 5 percent fall into the extreme tails of the distribution. this is sensible, because we have defined the extreme part here as the range \((-\infty,c_2] \cup [c_2,\infty]\), where \(-c_2\) (\(c_2\)) is the 2.5 (97.5) percentile of the distribution.

Week 7

Exercise 16

Remember that pnorm(x) = P(Z < x) and that 1 - pnorm(x) = P(Z > x).

  1. P(Z > 1.34)
x = seq(-4,4,by=0.01)
y = dnorm(x,0,1)
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = 1.34,col = "red")
text(2.3,0.05,"P(Z > 1.34)")

prob = 1 - pnorm(1.34,0,1)
prob
## [1] 0.09012267
  1. P(Z < 1.34)
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = 1.34,col = "red")
text(0,0.05,"P(Z < 1.34)")

prob = pnorm(1.34,0,1)
prob
## [1] 0.9098773
  1. P(Z < 2.15)
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = 2.15,col = "red")
text(0,0.05,"P(Z < 2.15)")

prob = pnorm(2.15,0,1)
prob
## [1] 0.9842224
  1. P(Z < 1.2)
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = 1.2,col = "red")
text(0,0.05,"P(Z < 1.2)")

prob = pnorm(1.2,0,1)
prob
## [1] 0.8849303
  1. P(0.52 < Z < 2.34)
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = 2.34,col = "red")
abline(v = 0.52,col = "red")

text(1.5,0.3,"P(0.52 < 
Z < 
2.34)")

prob = pnorm(2.34,0,1) - pnorm(0.52,0,1)
prob
## [1] 0.2918899
  1. P(-2.34 < Z < -0.52)
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = -2.34,col = "red")
abline(v = -0.52,col = "red")

text(-1.5,0.3,"P(0.52 < 
Z < 
2.34)")

prob = pnorm(-0.52,0,1) - pnorm(-2.34,0,1) 
prob
## [1] 0.2918899
  1. P(Z < -0.93)
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = -0.93,col = "red")

text(-2,0.3,"P(Z < -0.93)")

prob = pnorm(-0.93,0,1) 
prob
## [1] 0.1761855
  1. P(-1.57 < Z < - 0.32)
plot(x,y,type="l",xlab="Z",ylab="density")
abline(v = -1.57,col = "red")
abline(v = - 0.32,col = "red")

text(-1.5,0.3,"P(-1.57 < 
Z < 
- 0.32)")

prob = pnorm(-0.32,0,1) - pnorm(-1.57,0,1) 
prob
## [1] 0.3162766

Exercise 17

We first calculate the mean for each genotype:

stickleback = read.csv("~/Dropbox/Teaching/StatisticsForBiology/chap03e3SticklebackPlates.csv")

mean.mm = mean(stickleback$plates[stickleback$genotype=="mm"])
mean.Mm = mean(stickleback$plates[stickleback$genotype=="Mm"])
mean.MM = mean(stickleback$plates[stickleback$genotype=="MM"])

Next we do teh same thing for the standard deviation:

sd.mm = sd(stickleback$plates[stickleback$genotype=="mm"])
sd.Mm = sd(stickleback$plates[stickleback$genotype=="Mm"])
sd.MM = sd(stickleback$plates[stickleback$genotype=="MM"])

And now we can make 3 plots, one fore each case:

x = seq(0,100,by = 0.01)

hist(stickleback$plates[stickleback$genotype=="mm"],freq=F,
     main =  "Genotype: mm",xlab = "Plates",ylab  = "Frequency ")
y = dnorm(x,mean.mm,sd.mm)
lines(x,y,lwd=2,col="red")

hist(stickleback$plates[stickleback$genotype=="Mm"],freq=F,
     main =  "Genotype: Mm",xlab = "Plates",ylab  = "Frequency ",
     xlim=c(0,100))
y = dnorm(x,mean.Mm,sd.Mm)
lines(x,y,lwd=2,col="red")

hist(stickleback$plates[stickleback$genotype=="MM"],freq=F,
     main =  "Genotype: MM",xlab = "Plates",ylab  = "Frequency ",
     xlim=c(0,100))
y = dnorm(x,mean.MM,sd.MM)
lines(x,y,lwd=2,col="red")

It can be seen clearly from the images that the normal distribution is not a very good approximation to reality in this case. In particular for heterozygotes.

  1. We use the function t.test to calculate the upper and lower limit of the 95% CI:
t.t.mm = t.test(stickleback$plates[stickleback$genotype=="mm"])
t.t.Mm = t.test(stickleback$plates[stickleback$genotype=="Mm"])
t.t.MM = t.test(stickleback$plates[stickleback$genotype=="MM"])

CI.mm = t.t.mm$conf.int
CI.Mm = t.t.Mm$conf.int
CI.MM = t.t.MM$conf.int

We put it all in a dataframe to make comparison easier:

df = data.frame(means = c(mean.mm,mean.Mm,mean.MM),
                lower = c(CI.mm[1],CI.Mm[1],CI.MM[1]),
                upper = c(CI.mm[2],CI.Mm[2],CI.MM[2]),
                genotypes = c("mm","Mm","MM"))

Now we calculate the standard error of the mean for each genotype. We can use the standard deviation we have calculated in (a) for this. First we need the sample size for each group:

n.mm = length(stickleback$genotype=="mm")
n.Mm = length(stickleback$genotype=="Mm")
n.MM = length(stickleback$genotype=="MM")

Then we can directly calculate the standard error:

SE.mm = sd.mm/(sqrt(n.mm))
SE.Mm = sd.mm/(sqrt(n.Mm))
SE.MM = sd.mm/(sqrt(n.MM))

Adding the results to the data frame:

lower.approx = c(mean.mm-2*SE.mm,mean.Mm-2*SE.Mm,mean.MM-2*SE.MM)
upper.approx = c(mean.mm+2*SE.mm,mean.Mm+2*SE.Mm,mean.MM+2*SE.MM)

df.2 = mutate(df, lower.2SE = lower.approx,
              upper.2SE = upper.approx)

print(df.2)
##      means    lower    upper genotypes lower.2SE upper.2SE
## 1 11.67045 10.91451 12.42640        mm  11.28573  12.05518
## 2 50.37931 48.11287 52.64575        Mm  49.99458  50.76404
## 3 62.78049 62.03116 63.52982        MM  62.39576  63.16521

We can see that the values are similar but that the 2 SE method gives us narrower confidence intervals. This is due to the small sample size so that the sampling distribution looks more like t distribution than a normal distribution.

Week 8

Exercise 18

geneContent <- read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter08/chap08e4XGeneContent.csv"))

propX = 1055/20290

propAut = 19235/20290

head(geneContent)
##   chromosome
## 1          X
## 2          X
## 3          X
## 4          X
## 5          X
## 6          X
geneContent$chromosome <- factor(geneContent$chromosome, levels = c("X","NotX"))


geneContentTable <- table(geneContent$chromosome)

barplot(geneContentTable)

ggplot(geneContent) + 
  geom_bar(mapping = aes(x = chromosome))

chisq.test( geneContentTable, p =  c(propX,propAut))
## 
##  Chi-squared test for given probabilities
## 
## data:  geneContentTable
## X-squared = 75.065, df = 1, p-value < 2.2e-16

Intepretation:

Our tests shows that there is a very small \(p-value\) indicating that the null hypothesis

\(H_0:\) The proportion of genes on the X-chromomose is the same as the proportion of base pairs on the X-Chromosome.

can be rejected in favor of the alternative hypothesis

\(H_A:\) The proportion of genes on the X-chromomose is not the same as the proportion of base pairs on the X-Chromosome.

Exercise 19

birthDay <- read.csv(url("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter08/chap08e1DayOfBirth.csv"))
head(birthDay)
##      day
## 1 Sunday
## 2 Sunday
## 3 Sunday
## 4 Sunday
## 5 Sunday
## 6 Sunday

Put the days of the week in the correct order for tables and graphs.

birthDay$day <- factor(birthDay$day, levels = c( "Monday",
    "Tuesday", "Wednesday","Thursday","Friday","Saturday","Sunday"))

birthDayTable <- table(birthDay$day)
data.frame(Frequency = addmargins(birthDayTable))
##   Frequency.Var1 Frequency.Freq
## 1         Monday             41
## 2        Tuesday             63
## 3      Wednesday             63
## 4       Thursday             47
## 5         Friday             56
## 6       Saturday             47
## 7         Sunday             33
## 8            Sum            350

Bar graph of the birth data. The argument cex.names = 0.8 shrinks the names of the weekdays to 80% of the default size so that they fit in the graphics window – otherwise one or more names may be dropped.

barplot(birthDayTable, cex.names = 0.8)

shortNames = substr(names(birthDayTable), 1, 3)
barplot(table(birthDay$day), names = shortNames,
ylab = "Frequency", las = 1, col = "firebrick")

\(\chi^2\) goodness-of-fit test. The vector p is the expected proportions rather than the expected frequencies, and they must sum to 1 (R nevertheless uses the expected frequencies when calculating the test statistic).

chisq.test(birthDayTable, p = rep(1/7,times=7))
## 
##  Chi-squared test for given probabilities
## 
## data:  birthDayTable
## X-squared = 15.24, df = 6, p-value = 0.01847

We find a small \(p-value\) (< 0.05) so we can rejected the null hypothesis

$H_0: $ birthdays are uniformly distributed across weekdays.

in favor of the alternative hypothesis

$H_A: $ birthdays are not uniformly distributed across weekdays.

From inspecting tha data, it appears as if births are less common on Sundays. It could be that scheduled caesarean deliveries are not scheduled on weekends. Another reason could be that hospitals may sometimes be understaffed on weekends and hence there might be a tendency to wait a bit longer with inducing labor in some non-critical cases.

Note, however, that the p-value is not extremely small. We would reject the null hypothesis if we choose \(\alpha = 0.05\) but not if we chosse \(\alpha = 0.01\). This illustrates that your choice of rejecting or not rejecting a null hypothesis is strongly influenced by the degree of certainty you want to have in your decision.

Week 9

Exercise 20

stalkie <- read.csv(url("http://www.zoology.ubc.ca/~schluter/WhitlockSchluter/wp-content/data/chapter11/chap11e2Stalkies.csv"))


m0 = 9
M <- mean(stalkie$eyespan)
n <- length(stalkie$eyespan)
sigma <- sd(stalkie$eyespan)
SE <- sigma / sqrt(n)
test.statistic <- (M - m0) / SE

p.value = 2 * pt(-abs(test.statistic),df=8)
print(p.value)
## [1] 0.1327105
x=seq(-4,4,by=0.01)
y = dt(x,df=8)
plot(x,y,type="l")

abline(v = test.statistic,col="dodgerblue")
text(-3,0.2,"test statistic",col="dodgerblue")
text(2,0.3,"t-distribution, 
     with 8 degrees of freedom")

x1 = seq(-4,test.statistic,by=0.01)
y1 = dt(x1,df=8)

x2 = rev(x1)
y2 = y1*0

polygon(c(x1,x2),c(y1,y2),col = "red")
polygon(-c(x1,x2),c(y1,y2),col = "red")
text(-3.5,0.05,"P value",col="red")

The p value of the t-test is

test = t.test(stalkie$eyespan-9)

print(test$p.value)
## [1] 0.1327105

and the test statistic is

print(test$statistic)
##         t 
## -1.673772

We can summarize this in a table:

dat = data.frame(pValues = c(p.value,test$p.value),tStatistics = c(test.statistic,test$statistic))
colnames(dat) = c("P Value","Test Statistic")
rownames(dat) = c("by hand","using t.test()")
kableExtra::kable(dat) 
P Value Test Statistic
by hand 0.1327105 -1.673772
using t.test() 0.1327105 -1.673772

Exercise 21

data = c(58.9, 7.8, 108.6, 44.8, 11.1, 19.2, 61.9, 30.5, 12.7, 35.8, 7.4, 39.3, 24.0, 62.1, 24.3, 55.3, 32.7, 65.3, -19.3, 7.6, -5.2, -2.1, 31.0, 69.0, 88.6, 39.5, 20.7, 89.0, 69.0, 64.9, 64.8)
  1. What is the sample size n?
n = length(data)

print(paste("The sample size is ",n))
## [1] "The sample size is  31"
  1. What is the mean of these data points? Remember to give the units.
m = mean(data)

print(paste("The mean shift is ",m, "meters."))
## [1] "The mean shift is  39.3290322580645 meters."
  1. What is the standard deviation of elevational range shift? (Give the units as well.)
sd = sd(data)

print(paste("The staandard deviation of the shift is ",round(sd,digits=2)," meters."))
## [1] "The staandard deviation of the shift is  30.66  meters."
  1. What is the standard error of the mean for these data?
SE = sd/sqrt(n)

print(paste("The standard error of the mean is ",round(SE,digits=2),"."))
## [1] "The standard error of the mean is  5.51 ."
  1. How many degrees of freedom will be associated with a confidence interval and a one-sample t-test for the mean elevation shift?

There are 30 (= \(n-1\)) degrees of freedom.

We can check with the t.test() function:

tt = t.test(data)
tt
## 
##  One Sample t-test
## 
## data:  data
## t = 7.1413, df = 30, p-value = 6.057e-08
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  28.08171 50.57635
## sample estimates:
## mean of x 
##  39.32903
  1. Calculate the 95 % confidence interval for the mean using these data.

We can either use our results from the t-test before:

tt$conf.int
## [1] 28.08171 50.57635
## attr(,"conf.level")
## [1] 0.95

or

upper = m + 2*SE
lower = m - 2*SE

c(lower,upper)
## [1] 28.31452 50.34355

A more precise version:

upper = m + qt(0.975,df = n - 1)*SE
lower = m + qt(0.025,df = n - 1)*SE
c(lower,upper)
## [1] 28.08171 50.57635
  1. For the one-sample t-test, write down the appropriate null and alternative hypotheses.

\(H_0:\) The mean elevational shift is 0 meters. \(H_A:\) The mean elevational shift is not 0 meters.

  1. Calculate the test statistic t for this test.

We canu again just use t.test():

tt$statistic
##        t 
## 7.141309

or do it by hand:

mu.0 = 0 # as specified in the null hypothesis
t.stat = (m - mu.0)/SE
t.stat 
## [1] 7.141309
  1. What assumptions are necessary to do a one-sample t-test?

The data should be approximately normally distributed. We can check this with a histogram

hist(data)

A slightly more precise way would be to do a QQ plot

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
qqPlot(data)

## [1]  3 19

As long as the points are within the dashed lines, you can safely do a t-test.

Week 10

Exercise 22

  1. Calculate the P-value for this test as accurately as you can.

The most accurate way is either to use t-test again

tt$p.value
## [1] 6.056689e-08

We can also do it by hand. For this it is good to make a sketch first:

# we first calcualte the null distirbution of our t statisitc
x = seq(-8,8,by=0.01)
y = dt(x,df = n-1)

# and plot it
plot(x,y,type="l",xlab="t",ylab="density")

# we add the observed t statistic
abline(v = t.stat,col="slategray")

We see in the figure that the observed t statistic is far away from the mode of the t distribution. The p value shoul dtherfroe be very small. To calcualte it, we calcualte the area under the curve to the right of the statistic, and double this value to account for extreme events on the left end of the distirbution (we perform a two-sided test). The function pt would give us the area to the left of the test statistic, so we use 2*(1-pt(t.stat,df = n-1)) to get the p-value:

2*(1-pt(t.stat,df = n-1))
## [1] 6.056689e-08
  1. Did species change their highest elevation on average?

We choose a signifcance threshold of 5 percent, i.e., we want to be 95 percent certain that we do not have a false positive. Th p-value is clearly smaller than that. Thus, the t-test shows that there is a statistically significant change in the highest elevation. Because the p-value is so small, we can be very certain of this result as it would alos hold at much lower signifcance thresholds.

Exercise 23

Previously we used a one sample t-test to calculate the confidence intervals. This test also tested the null hypothesis that the means of each group are 0. If we want to compare means between groups we need to do a two sample t-test:

t.test(stickleback$plates[stickleback$genotype=="mm"],
       stickleback$plates[stickleback$genotype=="Mm"])
## 
##  Welch Two Sample t-test
## 
## data:  stickleback$plates[stickleback$genotype == "mm"] and stickleback$plates[stickleback$genotype == "Mm"]
## t = -32.001, df = 208.06, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -41.09355 -36.32416
## sample estimates:
## mean of x mean of y 
##  11.67045  50.37931
t.test(stickleback$plates[stickleback$genotype=="mm"],
       stickleback$plates[stickleback$genotype=="MM"])
## 
##  Welch Two Sample t-test
## 
## data:  stickleback$plates[stickleback$genotype == "mm"] and stickleback$plates[stickleback$genotype == "MM"]
## t = -95.49, df = 167.89, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -52.16670 -50.05336
## sample estimates:
## mean of x mean of y 
##  11.67045  62.78049
t.test(stickleback$plates[stickleback$genotype=="Mm"],
       stickleback$plates[stickleback$genotype=="MM"])
## 
##  Welch Two Sample t-test
## 
## data:  stickleback$plates[stickleback$genotype == "Mm"] and stickleback$plates[stickleback$genotype == "MM"]
## t = -10.262, df = 207.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -14.78364 -10.01871
## sample estimates:
## mean of x mean of y 
##  50.37931  62.78049

We find that all t-test yield a significant result (small p values). This means that the null hypothesis of the mean of two groups being the same can be rejected for all comparisons.

A shorter way is by using the function pairwise.t.tests. The First argument is the variable on which the t-test should be performed. The second parameters is the grouping factor: the categorical variable that separates the first argument into different groups.

 pairwise.t.test(stickleback$plates,stickleback$genotype)
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  stickleback$plates and stickleback$genotype 
## 
##    mm      Mm     
## Mm < 2e-16 -      
## MM < 2e-16 1.5e-15
## 
## P value adjustment method: holm

How does this finding relate to the confidence intervals? The confidence intervals are not overlapping, strongly suggesting that the means between groups are indeed different (at the 95 percent level). Thus, the t-test and the confidence interval indicate the same result.

Week 11

Exercise 24

mouse_cooling <- read.csv("~/Dropbox/Teaching/StatisticsForBiology/mouse_cooling.csv")

mouse_cooling = mutate(mouse_cooling,diff = freshly.killed - reheated)   
ggplot(mouse_cooling) + 
  geom_histogram(aes(x = diff),bins=7)

The data is clearly right skewed. This can also be seen with a QQ plot:

library(car)
qqPlot(mouse_cooling$diff)

## [1] 15  2

The null hypothesis is

\(H_0:\) The true mean of the differences is \(\mu_0 = 0\).

We try the following log transformation: \(x' = log(100,100+x)\). Note that we add a constant so that we do not get negative values and then use the base 100 logarithm. The new value of our null hpyothesis is then \(\mu_0' = 1\) because \(x' = log(100,100+x)\) and therefore \(\mu_0' = log(100,100 + \mu_0) = 1\).

library(car)
qqPlot(log(100,100+mouse_cooling$diff))

## [1] 15 18
ggplot(mouse_cooling) + 
  geom_histogram(aes(x = log(100,100+diff)),bins=7)

This looks much better now and we perform a t-test on these data:

results = t.test(log(100,100+mouse_cooling$diff),mu = 1)
results
## 
##  One Sample t-test
## 
## data:  log(100, 100 + mouse_cooling$diff)
## t = -1.4466, df = 18, p-value = 0.1652
## alternative hypothesis: true mean is not equal to 1
## 95 percent confidence interval:
##  0.9254516 1.0137511
## sample estimates:
## mean of x 
## 0.9696013

An alternative would be the sign test:

k = sum(mouse_cooling$diff<0)
results.sign = binom.test(k,n=19,p=0.5)
results.sign
## 
##  Exact binomial test
## 
## data:  k and 19
## number of successes = 7, number of trials = 19, p-value = 0.3593
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.1628859 0.6164221
## sample estimates:
## probability of success 
##              0.3684211

In both cases we find no significant devation from 0 in the differnces. We thus cannot reject the null hypothesis.

Exercise 25

  1. \(\chi^2\) Goodness of fit test H0: Fish numbers per unit area follow a Poisson distribution.

  2. Two sample t-test or a non-parametric alternative. H0: There is no difference in the mean tumor size change in the two groups.

  3. One sample t-test or a non-parametric alternative. H0: Themean tumor size change durign the treatment is 0.

  4. \(\chi^2\) Goodness-of-fit test H0: The number of smokers in a city is proportional to the number of people in that city.

  5. One-sample t-test H0: Mean change in patient body mass during hospital stay is zero.

Exercise 26

health.expenditure <- read.csv("~/Dropbox/Teaching/StatisticsForBiology/health_expenditure.csv")
ggplot(health.expenditure) +
  geom_histogram(aes(x = Per.capita.health.expenditure.in.2010),bins=10)

Becaue it is highly right skewed.

  1. The sample size is
n = dim(health.expenditure)[1]
n
## [1] 18
  1. Calculate the natural logarithm of each data point.
data = mutate(health.expenditure,log.expend=  log(Per.capita.health.expenditure.in.2010))
data$log.expend
##  [1] 6.761573 5.768321 5.476464 6.782192 6.156979 4.276666 4.094345 6.408529
##  [9] 3.850148 5.192957 5.509388 4.691348 7.436617 4.997212 5.888878 4.343805
## [17] 7.305860 6.522093
ggplot(data) +
  geom_histogram(aes(x = log.expend),bins=4)

qqPlot(data$log.expend)

## [1] 13  9
  1. The mean is
 M =mean(data$log.expend)
M
## [1] 5.636854
  1. The standard deviation is
SD = sd(data$log.expend)
SD
## [1] 1.110587

The standard error is

SE = SD/sqrt(n)
SE
## [1] 0.2617679
CI.log = t.test(data$log.expend)$conf.int
CI.log
## [1] 5.084572 6.189136
## attr(,"conf.level")
## [1] 0.95

or alterantively we can approximate it by:

CI.log.app = c(M-2*SE,M+2*SE)
CI.log.app
## [1] 5.113318 6.160390
CI.nonlog = exp(CI.log)
CI.nonlog
## [1] 161.5108 487.4248
## attr(,"conf.level")
## [1] 0.95
M.nonlog = mean(data$Per.capita.health.expenditure.in.2010)
hist(data$Per.capita.health.expenditure.in.2010,col="firebrick",main="",breaks=10)
abline(v = CI.nonlog,lty=2)
abline(v = M.nonlog,lwd=2)

Note how asymmetric the back transformed CI is - this reflects the asymmetry of the data. Be careful however, as this is an approximation! (Note: mean of logarithm transformed data is not the same as the logarithm of the mean of the data).

Week 12

Exercise 27

no.recycling = c(4,4,4,4,4,4,4,5,8,9,9,9,9,12,12,13,14,14,14,14,15,23)
recycling = c(4,5,8,8,8,9,9,9,12,14,14,15,16,19,23,28,40,43,129,130)

n.rec = length(recycling)
n.no.rec = length(no.recycling)

group = rep(c("Recycling","No.Recycling"),times=c(n.rec,n.no.rec))

dat = data.frame(var = c(recycling,no.recycling),group = group)

ggplot(dat) + 
  geom_histogram(aes(x = var,fill=group))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(dat) + 
  geom_histogram(aes(x = var,fill = group),bins=12) + 
  facet_wrap(~group,nrow=2) 

Both shape and spread seem to be different here, altough it is diffcult to tell based on the small sample sice. A Normal Distribution can however be ruled out, especially fro the recycling group.

ggplot(dat) + 
  geom_qq(aes(sample = var,col=group)) + 
  geom_qq_line(aes(sample = var,col=group))+ 
  facet_wrap(~group,nrow=2) 

qqPlot(recycling)

## [1] 20 19
qqPlot(no.recycling)

## [1] 22 21
  1. A t-test should not be used. Based on the small sample size, a permutation test will not work well (see exercise 27). We opt for a Man-Whitney U test.

c and d)

\(H_0:\) It is equally likely that a randomly selected value from one group will be less than or greater than a randomly selected value from a second population. \(H_A:\) It is not equally likely that a randomly selected value from one group will be less than or greater than a randomly selected value from a second population.

wilcox.test(var~group,data=dat)
## Warning in wilcox.test.default(x = c(4, 4, 4, 4, 4, 4, 4, 5, 8, 9, 9, 9, :
## cannot compute exact p-value with ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  var by group
## W = 126.5, p-value = 0.01825
## alternative hypothesis: true location shift is not equal to 0
# alternative without dataframes
wilcox.test(recycling,no.recycling)
## Warning in wilcox.test.default(recycling, no.recycling): cannot compute exact p-
## value with ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  recycling and no.recycling
## W = 313.5, p-value = 0.01825
## alternative hypothesis: true location shift is not equal to 0

For comparison, a t-test in this case would yield (not taht the t-test uses a different null hypothesis ):

t.test(var~group,data=dat)
## 
##  Welch Two Sample t-test
## 
## data:  var by group
## t = -2.1447, df = 19.681, p-value = 0.04465
## alternative hypothesis: true difference in means between group No.Recycling and group Recycling is not equal to 0
## 95 percent confidence interval:
##  -34.9245559  -0.4663532
## sample estimates:
## mean in group No.Recycling    mean in group Recycling 
##                   9.454545                  27.150000

Both test actually give similar results, which is a good sign. We therefore reject the null hypotesis. A randomly selected bin from the no recycling group is more likely to contain more paper as compared to a randomly selected bin from the other group.

Exercise 28

We first calculate the actual difference of the means:

no.recycling = c(4,4,4,4,4,4,4,5,8,9,9,9,9,12,12,13,14,14,14,14,15,23)
recycling = c(4,5,8,8,8,9,9,9,12,14,14,15,16,19,23,28,40,43,129,130)

n.rec = length(recycling)
n.no.rec = length(no.recycling)

group = rep(c("Recycling","No.Recycling"),times=c(n.rec,n.no.rec))

dat = data.frame(var = c(recycling,no.recycling),group = group)
MeanR = mean(no.recycling)
MeanNR = mean(recycling)
diffMeans =  MeanNR - MeanR
nPerm = 1000
permResult <- vector() # initializes
for(i in 1:nPerm){
    # step 1: permute the group labels
    permSample <- sample(dat$group, replace = FALSE)
    
    # step 2: calculate difference betweeen means
    permMeanR <-mean(dat$var[permSample=="Recycling"])
    permMeanNR <-mean(dat$var[permSample=="No.Recycling"])
    permResult[i] <- permMeanNR - permMeanR
}

Plot null distribution based on the permuted differences.

hist(permResult, right = FALSE, breaks = 100,col="darkred")

Note how the distirbuion looks odd. This is due to the small sample size: \(n_1 = 20, n_2 =22\). More precily, this is problematic because there are few extreme values in the data set (129 and 130 in the recycling dataset). These exterem points will ahve a strong impact on the empirical null distirbution: if both are in the same resampled group, they will shift the whole distirbution strongly to the left or right. This causes the bimodal distribution. This is a sign that the sample size might be too small for a permutation test. However, usually this leads to small power and large p-values. So we cannevertheless use the null distribution to calculate an approximate P-value, we just need to be aware that our results will be very conservative. The p value is twice the proportion of the permuted means that fall below the observed difference in means, diffMeans. The following code calculates the number of permuted means falling below diffMeans.

sum(as.numeric(permResult >= diffMeans))
## [1] 1

These commands obtain the fraction of permuted means falling below diffMeans.

sum(as.numeric(permResult >= diffMeans)) / nPerm
## [1] 0.001

Finally, multiply by 2 to get the P-value for a two-sided test.

2 * ( sum(as.numeric(permResult > diffMeans)) / nPerm )
## [1] 0.002

We can illustrate the permutaed differences and add the observed difference on to show the outcome of our test: the birght red colored bars indicate the proportion of permutation results that yielded a more exterme results as the one observed.

hist(permResult[permResult<=diffMeans],breaks=seq(-20,20,by=1),col="darkred",main="")
hist(permResult[permResult>diffMeans],breaks=seq(-20,20,by=1),add=T,col="red")
abline(v = diffMeans,col="red",lwd=2)

An alternative approach is

library(perm)
permTS(dat$var[dat$group=="Recycling"],dat$var[dat$group=="No.Recycling"],exact = T)
## 
##  Exact Permutation Test Estimated by Monte Carlo
## 
## data:  GROUP 1 and GROUP 2
## p-value = 0.004
## alternative hypothesis: true mean GROUP 1 - mean GROUP 2 is not equal to 0
## sample estimates:
## mean GROUP 1 - mean GROUP 2 
##                    17.69545 
## 
## p-value estimated from 999 Monte Carlo replications
## 99 percent confidence interval on p-value:
##  1.003509e-05 1.482735e-02

We again find that we can reject the Null Hypothesis.

Week 13

Exercise 29

dom = "http://www.zoology.ubc.ca/" 
path = "~schluter/WhitlockSchluter/wp-content/data/" 
dat = "chapter15/chap15e1KneesWhoSayNight.csv" 
location = paste(dom,path,dat,sep="") 
circadian <- read.csv(url(location),stringsAsFactors = FALSE) 
  1. Visualize the data:
ggplot(data = circadian) + 
  geom_violin(aes(x = treatment,
                  y =shift),
              fill=rgb(0.3,0.5,0.6,0.7),
              bw = 0.5) + 
  geom_point(aes(x = treatment,y = shift),
             col="darkslategray",size=3) +
  theme_minimal()

  1. Test for differences in means:
mod = lm(shift ~ treatment,data = circadian)

anova(mod)
## Analysis of Variance Table
## 
## Response: shift
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## treatment  2 7.2245  3.6122  7.2894 0.004472 **
## Residuals 19 9.4153  0.4955                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We find that not all groups have the same mean.

pairwise.t.test(circadian$shift,circadian$treatment)
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  circadian$shift and circadian$treatment 
## 
##      control eyes  
## eyes 0.0088  -     
## knee 0.9418  0.0088
## 
## P value adjustment method: holm

We find that the the “eyes” is different from the other two groups, but “knee” is the same as “control”. This can also be seen very well on the plot of the raw data.

  1. The new study failed at replicating the older results, which suggested that light treatment behind the knee can alter your circadian rhythm of melatonin production. Light treatment behind the knee indeed has a different effect than light treatment to the eyes. However, inclusion of a control group shows that only light treatment at the eyes showed an effect relative to the control group. This illustrates how important a proper control group is in experimental design. Without a proper control group, it is nearly impossible to determine whether a certain treatment indeed has an effect.

Exercise 30

Load the data set:

dom = "http://www.zoology.ubc.ca/" 
path = "~schluter/WhitlockSchluter/wp-content/data/" 
dat = "chapter17/chap17e1LionNoses.csv" 
location = paste(dom,path,dat,sep="") 
lion <- read.csv(url(location)) 
head(lion) 
##   proportionBlack ageInYears
## 1            0.21        1.1
## 2            0.14        1.5
## 3            0.11        1.9
## 4            0.13        2.2
## 5            0.12        2.6
## 6            0.13        3.2
ggplot(data = lion) + 
  geom_point(aes(x = ageInYears,y = proportionBlack)) +
  theme_classic()

  1. The correlation coefficient can be calcualted using the function cor:
cor(lion$ageInYears,lion$proportionBlack)
## [1] 0.7898272
model = lm( ageInYears~ proportionBlack,data=lion)
summary(model)
## 
## Call:
## lm(formula = ageInYears ~ proportionBlack, data = lion)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5449 -1.1117 -0.5285  0.9635  4.3421 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.8790     0.5688   1.545    0.133    
## proportionBlack  10.6471     1.5095   7.053 7.68e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.669 on 30 degrees of freedom
## Multiple R-squared:  0.6238, Adjusted R-squared:  0.6113 
## F-statistic: 49.75 on 1 and 30 DF,  p-value: 7.677e-08

The coefficients of our model are:

print(model$coefficients)
##     (Intercept) proportionBlack 
##       0.8790062      10.6471194

The linear model for the regression line is therefore

\[ageInYears = 0.879 + 10.647 * proportionBlack\]

library(car)
qqPlot(model$residuals,dist="norm")

## [1] 30 27

A QQ-plot of the residuals shows that they are normally distributed.

plot(model$fitted,model$residuals)

The Tukey-Anscombe plot shows no (strong) deviation from uniform variance of residuals.

We get confindence intervals with the function confint:

confint(model)
##                      2.5 %    97.5 %
## (Intercept)     -0.2826733  2.040686
## proportionBlack  7.5643082 13.729931

This can be directly seen from the CIs from (e) or the summary of the model:

summary(model)
## 
## Call:
## lm(formula = ageInYears ~ proportionBlack, data = lion)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5449 -1.1117 -0.5285  0.9635  4.3421 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.8790     0.5688   1.545    0.133    
## proportionBlack  10.6471     1.5095   7.053 7.68e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.669 on 30 degrees of freedom
## Multiple R-squared:  0.6238, Adjusted R-squared:  0.6113 
## F-statistic: 49.75 on 1 and 30 DF,  p-value: 7.677e-08

The p-value for the null hypothesis that the slope is 0 is very small, \(P = 7.7 * 10^-8\). We can therfore reject the null hypothesis that \(beta_1 = 0\).

This is easiest using geom_smooth() in ggplot:

ggplot(data = lion,aes(x = ageInYears,y = proportionBlack)) + 
  geom_point() +
  geom_smooth(method = "lm") +
  theme_classic()
## `geom_smooth()` using formula 'y ~ x'

See what happens when we do not specify the method in lm():

ggplot(data = lion,aes(x = ageInYears,y = proportionBlack)) + 
  geom_point() +
  geom_smooth() +
  theme_classic()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'